IAGLR 2021 H2O Workshop (bit.ly/xxxxxxxxxx)

1 Agenda

  • 09:00 to 09:30 Set Up & Introduction
  • 09:30 to 10:30 Regression Example
  • 10:30 to 11:00 Coffee Break
  • 11:00 to 11:30 Classification Example
  • 11:30 to 12:30 Bring Your Own Data + Q&A

2 Set Up

2.1 Download -> bit.ly/useR2019_h2o_tutorial

2.2 R Packages

  • Check out setup.R
  • For this tutorial:
    • h2o for automatic and explainable machine learning
  • For RMarkdown
    • knitr for rendering this RMarkdown
    • rmdformats for readthedown RMarkdown template
    • DT for nice tables

3 Introduction

General Data Protection Regulation (GDPR) is now in place. Are you ready to explain your models? This is a hands-on tutorial for R beginners. I will demonstrate the use of H2O and other R packages for automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O’s AutoML. They will then be able to explain the model outcomes with various methods.

It is a workshop for R beginners and anyone interested in machine learning. RMarkdown and the rendered HTML will be provided so everyone can follow without running the code.

4 H2O Basics

# Let's go
library(h2o) # for H2O Machine Learning

4.1 Start a local H2O Cluster (JVM)

h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 hours 37 minutes 
    H2O cluster timezone:       Europe/London 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.32.1.2 
    H2O cluster version age:    14 days  
    H2O cluster name:           H2O_started_from_R_joe_sid571 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   2.29 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.5 (2021-03-31) 
h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll()   # Optional: remove anything from previous session 
# Enter your lucky seed here ...
n_seed <- 12345

5 Data - Dom Lake Huron Abund

lake_data <- h2o.importFile('https://raw.githubusercontent.com/woobe/IAGLR_2021_H2O_Workshop/main/data/Dom_Lake_Huron_Abund.csv')
datatable(as.data.frame(head(lake_data)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)
h2o.describe(lake_data)
      Label Type Missing Zeros PosInf NegInf       Min        Max       Mean
1        C1  int       0     0      0      0    1.0000  885.00000  443.00000
2      Year  int       0     0      0      0 2006.0000 2012.00000 2009.00113
3    Region enum       0    96      0      0    0.0000    3.00000         NA
4   Station enum       0    24      0      0    0.0000  122.00000         NA
5 Replicate  int       0     1      0      0    0.0000    3.00000    1.99887
6       lat real       0     0      0      0   43.2695   46.23333   44.68487
        Sigma Cardinality
1 255.6217909          NA
2   2.3727809          NA
3          NA           4
4          NA         123
5   0.8190319          NA
6   0.8554837          NA
 [ reached 'max' / getOption("max.print") -- omitted 62 rows ]
h2o.hist(lake_data$DPOL, breaks = 100)

h2o.hist(lake_data$DBUG, breaks = 100)

5.1 Define Target and Features

target_DPOL <- "DPOL"
target_DBUG <- "DBUG"

# Remove targets, C1, and Dreisseniidae (which is DPOL + DBUG)
features <- setdiff(colnames(lake_data), c(target_DPOL, target_DBUG, "C1", "Dreisseniidae"))

print(features)
 [1] "Year"                  "Region"                "Station"              
 [4] "Replicate"             "lat"                   "lon"                  
 [7] "depth"                 "substrate"             "Season"               
[10] "Na2O"                  "Magnesium"             "MagnesiumOxide"       
[13] "AluminumOxide"         "SiliconDioxide"        "Quartz"               
[16] "PhosphorusPentoxide"   "Sulphur"               "DDT_TOTAL"            
[19] "P.P.TDE"               "P.P.DDE"               "P.P.DDE.1"            
[22] "HeptachlorEpoxide"     "Potassium"             "PotassiumOxide"       
[25] "Calcium"               "CalciumOxide"          "TitaniumDioxide"      
[28] "Chromium"              "Manganese"             "ManganeseOxide"       
[31] "XTR.Iron"              "Iron"                  "IronOxide"            
[34] "Cobalt"                "Nickel"                "Copper"               
[37] "Zinc"                  "Selenium"              "Strontium"            
[40] "Beryllium"             "Silver"                "Cadmium"              
[43] "Tin"                   "TotalCarbon"           "OrganicCarbon"        
[46] "Carbon.Nitrogen_Ratio" "TotalNitrogen"         "Mercury"              
[49] "Lead"                  "Uranium"               "Vanadium"             
[52] "Arsenic"               "Chloride"              "Fluoride"             
[55] "Sand"                  "Silt"                  "Clay"                 
[58] "Mean_grainsize"        "simpsonD"              "shannon"              
[61] "Chironomidae"          "Oligochaeta"           "Sphaeriidae"          
[64] "Diporeia"             
ml_overview

5.2 Split Data into Train/Test

h_split <- h2o.splitFrame(lake_data, ratios = 0.8, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluation
dim(h_train)
[1] 710  68
dim(h_test)
[1] 175  68

5.3 Cross-Validation

CV

6 Regression Example - Target “DPOL”

6.0.1 Baseline Models

  • h2o.glm(): H2O Generalized Linear Model
  • h2o.randomForest(): H2O Random Forest Model
  • h2o.gbm(): H2O Gradient Boosting Model
  • h2o.deeplearning(): H2O Deep Neural Network Model
  • h2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model (XGBoost) from DMLC

Let’s start with GBM

model_gbm_DPOL <- h2o.gbm(x = features,               # All 13 features
                              y = target_DPOL,            # medv (median value of owner-occupied homes in $1000s)
                              training_frame = h_train,   # H2O dataframe with training data
                              model_id = "baseline_gbm",  # Give the model a name
                              nfolds = 5,                 # Using 5-fold CV
                              seed = n_seed)              # Your lucky seed
# Cross-Validation
model_gbm_DPOL@model$cross_validation_metrics
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  5359.94
RMSE:  73.21161
MAE:  16.44797
RMSLE:  NaN
Mean Residual Deviance :  5359.94
# Evaluate performance on test
h2o.performance(model_gbm_DPOL, newdata = h_test)
H2ORegressionMetrics: gbm

MSE:  1357.527
RMSE:  36.84464
MAE:  10.64912
RMSLE:  NaN
Mean Residual Deviance :  1357.527

Let’s use RMSE

RMSE

6.0.2 Build Other Baseline Models (GLM, DRF, GBM, DNN) - TRY IT YOURSELF!

# Try other H2O models
# model_glm <- h2o.glm(x = features, y = target, ...)
# model_gbm <- h2o.gbm(x = features, y = target, ...)
# model_drf <- h2o.randomForest(x = features, y = target, ...)
# model_dnn <- h2o.deeplearning(x = features, y = target, ...)
# model_xgb <- h2o.xgboost(x = features, y = target, ...)

6.1 Manual Tuning

6.1.1 Check out the hyper-parameters for each algo

?h2o.glm 
?h2o.randomForest
?h2o.gbm
?h2o.deeplearning
?h2o.xgboost

6.1.2 Train a xgboost model with manual settings

model_gbm_DPOL_m <- h2o.gbm(x = features, 
                                y = target_DPOL, 
                                training_frame = h_train,
                                model_id = "model_gbm_m", 
                                nfolds = 5,
                                seed = n_seed,
                                # Manual Settings based on experience
                                learn_rate = 0.1,       # use a lower rate (more conservative)
                                ntrees = 100,           # use more trees (due to lower learn_rate)
                                sample_rate = 0.9,     # use random n% of samples for each tree  
                                col_sample_rate = 0.9) # use random n% of features for each tree

6.1.3 Comparison (RMSE: Lower = Better)

# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("GBM: Gradient Boosting Model (Baseline)",
                               "GBM: Gradient Boosting Model (Manual Settings)"),
                     stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA
# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_gbm_DPOL@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_gbm_DPOL_m@model$cross_validation_metrics@metrics$RMSE

d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL_m, newdata = h_test))
# Show Comparison (RMSE: Lower = Better)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

6.2 Hierarchical GLM (HGLM) with Random Effects

# Not working yet ...
# model_hglm_DPOL <- h2o.glm(x = features,
#                            y = target_DPOL,
#                            training_frame = h_train,
#                            model_id = "model_hglm", 
#                            seed = n_seed,
#                            family = "gaussian",
#                            rand_family = c("gaussian"),
#                            rand_link = c("identity"),
#                            HGLM = TRUE,
#                            random_columns = "Year",
#                            calc_like = TRUE)

6.3 H2O AutoML

# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl_DPOL = h2o.automl(x = features,
                         y = target_DPOL,
                         training_frame = h_train,
                         nfolds = 5,                     # 5-fold Cross-Validation
                         max_models = 20,                # Max number of models
                         stopping_metric = "RMSE",       # Metric to optimize
                         project_name = "automl_DPOL", # Specify a name so you can add more models later
                         seed = n_seed)

6.3.1 Leaderboard

datatable(as.data.frame(automl_DPOL@leaderboard), 
          rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

6.3.2 Best Model (Leader)

automl_DPOL@leader
Model Details:
==============

H2ORegressionModel: deeplearning
Model ID:  DeepLearning_grid__3_AutoML_20210513_171638_model_2 
Status of Neuron Layers: predicting DPOL, regression, gaussian distribution, Quadratic loss, 5,101 weights/biases, 84.5 KB, 1,185,700 training samples, mini-batch size 1
  layer units             type dropout       l1       l2 mean_rate rate_rms
1     1   211            Input 10.00 %       NA       NA        NA       NA
2     2    20 RectifierDropout 10.00 % 0.000000 0.000000  0.111998 0.200869
3     3    20 RectifierDropout 10.00 % 0.000000 0.000000  0.007790 0.013263
4     4    20 RectifierDropout 10.00 % 0.000000 0.000000  0.011148 0.014293
5     5     1           Linear      NA 0.000000 0.000000  0.002238 0.001988
  momentum mean_weight weight_rms mean_bias bias_rms
1       NA          NA         NA        NA       NA
2 0.000000    0.009646   0.278547  0.439409 1.247423
3 0.000000   -0.198634   0.412209  0.579820 0.792267
4 0.000000   -0.187279   0.642276  1.320260 0.579371
5 0.000000    0.000325   0.354135  0.696467 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **

MSE:  2304.368
RMSE:  48.00383
MAE:  21.53961
RMSLE:  2.67551
Mean Residual Deviance :  2304.368



H2ORegressionMetrics: deeplearning
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  4209.012
RMSE:  64.8769
MAE:  19.68972
RMSLE:  NaN
Mean Residual Deviance :  4209.012


Cross-Validation Metrics Summary: 
                            mean         sd cv_1_valid cv_2_valid cv_3_valid
mae                    19.689724  4.1271563  17.920628  15.404459  18.325184
mean_residual_deviance 4209.0127   1434.702  3613.0261  2488.5093  3668.7349
mse                    4209.0127   1434.702  3613.0261  2488.5093  3668.7349
r2                     0.5286015 0.17153801  0.5485696 0.49927586 0.72438794
residual_deviance      4209.0127   1434.702  3613.0261  2488.5093  3668.7349
rmse                    64.11342  11.095151   60.10845   49.88496  60.570084
rmsle                  2.0851088        0.0  2.0851088        NaN        NaN
                       cv_4_valid cv_5_valid
mae                     20.461622   26.33673
mean_residual_deviance  6159.7417   5115.051
mse                     6159.7417   5115.051
r2                     0.60964584 0.26112828
residual_deviance       6159.7417   5115.051
rmse                    78.484024  71.519585
rmsle                         NaN        NaN

6.3.3 Comparison (RMSE: Lower = Better)

d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
                         RMSE_cv = automl_DPOL@leader@model$cross_validation_metrics@metrics$RMSE,
                         RMSE_test = h2o.rmse(h2o.performance(automl_DPOL@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)

datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

6.4 Make Predictions

yhat_test <- h2o.predict(automl_DPOL@leader, newdata = h_test)
head(yhat_test)
   predict
1 19.14614
2 19.12452
3 13.77955
4 14.24310
5 13.70907
6 13.81577

6.5 Explainable AI - Target “DPOL”

6.5.1 Global Explanations

# Explain Manually Tuned GBM
h2o.explain(model_gbm_DPOL_m, newdata = h_test)


Residual Analysis
=================

> Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.



Variable Importance
===================

> The variable importance plot shows the relative importance of the most important variables in the model.



SHAP Summary
============

> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.



Partial Dependence Plots
========================

> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.



Individual Conditional Expectations
===================================

> An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.

6.5.2 Local Explanations

6.5.3 Local Contributions (for Tree-based Models)

predictions <- h2o.predict(model_gbm_DPOL_m, newdata = h_test)
as.data.frame(head(predictions, 5))
    predict
1 33.233473
2  7.628626
3 13.034093
4  6.006849
5 14.779455
contributions <- h2o.predict_contributions(model_gbm_DPOL_m, newdata = h_test)

datatable(as.data.frame(head(contributions)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
  formatRound(columns = -1, digits = 4)

7 Regression Example - Target “DBUG”

T.B.A.

8 Bring Your Own Data + Q&A

Get your hands dirty!